In [1]:
Author = "Aaron Stephenson"
ASUid = "1222366145"
                                #Success in the Video Game Industry

Introduction¶

The gaming industry generates billions of dollars in revenue each year. It is often very difficult to stay ahead of the competition when it comes to what will be popular and when. Game developers and publishers have to know if a game will perform well before investing significant resources into its development and marketing. The problem that we want to tackle is to predict the success a video game will have in the market. The "Video Game Sales" dataset, which was found on Kaggle, will be used for modeling purposes. This dataset has games from 1980-2016, including data on the games's name, platform, release year, genre, publisher, and global sales in millions of units. The goal is to build a machine learning model that can predict the success of a video game based on its characteristics. This type of information could help game developers and publishers make informed decisions about which games to invest vast resources into. With proper modeling it can help optimize their marketing and distribution strategies.
In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import tensorflow as tf

import sklearn

#confusion_matrix and accuracy_score may come in handy
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score


#KNN Classifier and Regression models
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression


#functions to split and scale our data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

#Functions to test my findings 
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import mean_squared_error, r2_score

#Allows us to exhaustively search for best hyperparameter using cross-validation
from sklearn.model_selection import GridSearchCV

from keras.wrappers.scikit_learn import KerasRegressor
from keras.models import Sequential
from keras.layers import Dense
To begin, we loaded the video game sales dataset, which contains information on video games released from 1980 to 2016, including data on the game's name, platform, release year, genre, publisher, and global sales in millions of units. Before proceeding with any analysis, we cleaned the data by dropping rows with missing values and renaming the "Year_of_Release" column to "Year" for ease of use.
In [3]:
df = pd.read_csv('Game_Success.csv')
In [4]:
# Clean the dataset by dropping missing values
df = df.dropna()

#Rename column for easier future use
df = df.rename(columns = {"Year_of_Release": "Year"})

df
Out[4]:
Name Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count User_Score User_Count Developer Rating
0 .hack//Infection Part 1 2002 Role-Playing Atari 0.49 0.38 0.26 0.13 1.27 75 35 8.5 60 CyberConnect2 T
1 .hack//Mutation Part 2 2002 Role-Playing Atari 0.23 0.18 0.20 0.06 0.68 76 24 8.9 81 CyberConnect2 T
2 .hack//Outbreak Part 3 2002 Role-Playing Atari 0.14 0.11 0.17 0.04 0.46 70 23 8.7 19 CyberConnect2 T
3 [Prototype] 2009 Action Activision 0.84 0.35 0.00 0.12 1.31 78 83 7.8 356 Radical Entertainment M
4 [Prototype] 2009 Action Activision 0.65 0.40 0.00 0.19 1.24 79 53 7.7 308 Radical Entertainment M
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6889 Zubo 2008 Misc Electronic Arts 0.08 0.02 0.00 0.01 0.11 75 19 7.6 75 EA Bright Light E10+
6890 Zumba Fitness 2010 Sports 505 Games 1.74 0.45 0.00 0.18 2.37 42 10 5.5 16 Pipeworks Software, Inc. E
6891 Zumba Fitness: World Party 2013 Misc Majesco Entertainment 0.17 0.05 0.00 0.02 0.24 73 5 6.2 40 Zoe Mode E
6892 Zumba Fitness Core 2012 Misc 505 Games 0.00 0.05 0.00 0.00 0.05 77 6 6.7 6 Zoe Mode E10+
6893 Zumba Fitness Rush 2012 Sports 505 Games 0.00 0.16 0.00 0.02 0.18 73 7 6.2 5 Majesco Games, Majesco E10+

6825 rows × 15 columns

Exploratory Analysis¶

We then proceeded with exploratory data analysis, starting with the distribution of the video game sales data. We plotted a histogram of the global sales, which showed that the majority of games have sales less than 5 million units. There is a long tail in the distribution, indicating that a few games have sold very well, potentially skewing the data. We also plotted a bar chart of the top 10 best-selling video games, with "Wii Sports" being the highest-selling game with over 80 million units sold.
Next, we examined the video game market by year, looking at the number of games released and the total global sales per year. The plot of the number of games released showed a steady increase in the number of games over time, with a sharp increase from the 2000s onwards. The plot of total global sales per year showed a similar trend, with a sharp increase in sales from the mid-90s to the mid-2000s and a decline thereafter.
Moving on, I plotted bar charts for the top 10 genres, with "Action" being the top genre and "Nintendo" being the top publisher. We also created scatterplots to visualize the relationship between global sales and other features such as critic and user scores, finding a weak positive correlation between global sales and both critic and user scores.
In [5]:
# Get summary statistics of the dataset
df.info()

# Show the number of missing values in each column
print(df.isnull().sum())

# Plot a bar plot of the total number of games in each genre
plt.figure(figsize=(10,6))
sns.countplot(data=df, x='Genre')
plt.title('Count of Each Genre')
plt.xticks(rotation=45)
plt.show()

#Set the plot style to block
sns.set_style('dark')

#Create a scatter plot of critic scores vs. user scores, with color based on genre
plt.figure(figsize=(10,6))
sns.scatterplot(data=df, x='Critic_Score', y='User_Score', hue='Genre', alpha=0.7, palette='bright')
plt.title('Critic Scores vs. User Scores by Genre')
plt.xlabel('Critic Scores')
plt.ylabel('User Scores')

#Add a legend
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6825 entries, 0 to 6893
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          6825 non-null   object 
 1   Year          6825 non-null   int64  
 2   Genre         6825 non-null   object 
 3   Publisher     6825 non-null   object 
 4   NA_Sales      6825 non-null   float64
 5   EU_Sales      6825 non-null   float64
 6   JP_Sales      6825 non-null   float64
 7   Other_Sales   6825 non-null   float64
 8   Global_Sales  6825 non-null   float64
 9   Critic_Score  6825 non-null   int64  
 10  Critic_Count  6825 non-null   int64  
 11  User_Score    6825 non-null   float64
 12  User_Count    6825 non-null   int64  
 13  Developer     6825 non-null   object 
 14  Rating        6825 non-null   object 
dtypes: float64(6), int64(4), object(5)
memory usage: 853.1+ KB
Name            0
Year            0
Genre           0
Publisher       0
NA_Sales        0
EU_Sales        0
JP_Sales        0
Other_Sales     0
Global_Sales    0
Critic_Score    0
Critic_Count    0
User_Score      0
User_Count      0
Developer       0
Rating          0
dtype: int64
In [6]:
import plotly.express as px

# Group the dataset by year and genre, count the number of games in each group
games_by_year_genre = df.groupby(['Year', 'Genre']).size().reset_index(name='Count')

# Create the interactive bar plot
fig = px.bar(games_by_year_genre, x='Year', y='Count', color='Genre', title='Number of Games Released by Year and Genre',
             labels={'Count':'Number of Games Released'}, hover_name='Genre')

# Display the plot
fig.show()
In [7]:
top_selling = df[['Name', 'Global_Sales']].groupby('Name').sum().sort_values(by='Global_Sales', ascending=False).head(10)

plt.figure(figsize=(12,6))
plt.barh(top_selling.index[::-1], top_selling['Global_Sales'][::-1])
plt.title('Top 10 Best-Selling Games of All Time')
plt.xlabel('Global Sales (millions)')
plt.show()

Overall, our exploratory data analysis provided insights into the distribution of global video game sales, the video game market trends over time, and the relationship between video game sales and various game characteristics. These insights will guide our subsequent feature engineering and machine learning model building.

In [8]:
#Create a correlation matrix
corr_matrix = df.corr()

#Set the figure size
fig, ax = plt.subplots(figsize=(16, 10))

#Plot a heatmap of the correlation matrix
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True, ax=ax)
plt.title('Correlation Matrix for Video Game Features')
plt.show()

Heatmap Coorelation¶

By examining the plot, we can see that the Global_Sales are positively correlated with the Critic_Score, User_Score, and Year features. This suggests that games with higher critic and user scores tend to sell better, as well as more recent releases. On the other hand, there is a weak negative correlation between Global_Sales and the Rank feature, indicating that games with lower ranks tend to sell better.

It's also interesting to see that the correlation between the sales in different regions (NA_Sales, EU_Sales, JP_Sales, Other_Sales) and Global_Sales is relatively strong, which is expected since global sales are a sum of sales in different regions.

Overall, this correlation matrix can give us some insight into which features may be important in predicting the success of a video game, and can help guide the feature selection process for our machine learning models.
In [9]:
from sklearn.pipeline import make_pipeline

#Select the features and target variable
X = df[['Critic_Score', 'User_Score']].values
y = df['Global_Sales'].values

#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

#Create a pipeline with a StandardScaler and LinearRegression
model = make_pipeline(StandardScaler(), LinearRegression())

#Fit the pipeline to the training data
model.fit(X_train, y_train)

#Make predictions on the test set
y_pred = model.predict(X_test)

#Calculate and print the mean squared error and R-squared score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Mean Squared Error: ', mse)
print('R-squared score: ', r2)
Mean Squared Error:  2.6931840681914085
R-squared score:  0.08383146751958037
In [10]:
'''The mean squared error (MSE) of the linear regression model is 2.693, which
means that on average, the model's predicted global sales value is off by
around 1.64 million units squared. The R-squared score of the model is 0.084,
which indicates that only 8.4% of the variability in the global sales of
video games can be explained by the model's predictor variables (critic score
and user score).'''
Out[10]:
"The mean squared error (MSE) of the linear regression model is 2.693, which\nmeans that on average, the model's predicted global sales value is off by\naround 1.64 million units squared. The R-squared score of the model is 0.084,\nwhich indicates that only 8.4% of the variability in the global sales of\nvideo games can be explained by the model's predictor variables (critic score\nand user score)."

Train, test and split the data.¶

We are now preparing the dataset for our machine learning model. The first step is to create dummy variables for the 'Genre' column using the get_dummies() method from Pandas. This allows us to convert categorical data into numerical data, which is necessary for many machine learning algorithms. The Genre column is then dropped and 'concat()' is used to merge the dummy variables for the remaining of the dataset. 
Next, define the input and output variables for our model. The input variables are the attributes that will be used to predict the output variable. In this case, the input variables are critic_score, user_score, developer, rating, year, and genre. The output variable is global_sales. Finally, split the dataset into training and testing sets using the train_test_split() method from Scikit-learn.
In [11]:
dumdum = pd.get_dummies(df['Genre'])
df = df.drop('Genre', axis=1)
df = pd.concat([df, dumdum], axis=1)

#Define the input and output variables
X = df.iloc[:, 5:11].values  # attributes: critic_score, user_score, developer, rating, year, genre
y = df.iloc[:, -1].values  # output variable: global_sales

#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

After splitting the data, we scale the input variables using StandardScaler(). Then, we build a k-nearest neighbor regression model using KNeighborsRegressor() with 5 neighbors. We evaluate the model's performance on the testing set and print the KNN score. The GridSearchCV() is used to search for the best hyperparameters for the KNN classifier model. The param_grid is set to different values and weight function. The best param score is then printed at the bottom, in order to possibly use later.

In [12]:
#Scale the input variables
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#Build the k-nearest neighbor model
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)

#Evaluate the k-nearest neighbor model
knn_score = knn.score(X_test, y_test)
print('KNN Score:', knn_score)

#Set up the parameter grid for GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7, 9, 11], 'weights': ['uniform', 'distance']}

#Create a KNN classifier object
knn = KNeighborsClassifier()

#Set up the GridSearchCV object
grid = GridSearchCV(knn, param_grid, cv=5)

#Fit the GridSearchCV object to the data
grid.fit(X_train, y_train)

#Print the best hyperparameters and the corresponding score
print("Best Hyperparameters:", grid.best_params_)
print("Best Score:", grid.best_score_)
KNN Score: -0.10310006209323208
Best Hyperparameters: {'n_neighbors': 11, 'weights': 'uniform'}
Best Score: 0.9630036630036629
In [13]:
#Set up the parameter grid for GridSearchCV
param_grid = {'n_neighbors': [3, 5, 7, 9, 11], 'weights': ['uniform', 'distance']}

#Create a KNN classifier object
knn = KNeighborsClassifier()

#Set up the GridSearchCV object
grid = GridSearchCV(knn, param_grid, cv=5)

#Fit the GridSearchCV object to the data
grid.fit(X_train, y_train)

#Print the best hyperparameters and the corresponding score
print("Best Hyperparameters:", grid.best_params_)
print("Best Score:", grid.best_score_)

#Build the k-nearest neighbor model with the best hyperparameters
knn = KNeighborsClassifier(n_neighbors=11, weights='uniform')
knn.fit(X_train, y_train)

#Evaluate the k-nearest neighbor model
knn_score = knn.score(X_test, y_test)
print('KNN Score:', knn_score)
Best Hyperparameters: {'n_neighbors': 11, 'weights': 'uniform'}
Best Score: 0.9630036630036629
KNN Score: 0.9509157509157509

In this section, a neural network model is created to predict the success of a video game based on its characteristics. First, defined the model using the Keras. The model has two layers, the first with 10 neurons and the input dimension of 6, and the second with 1 neuron and a linear activation function. Then model was compiled with the mean squared error loss function and the Adam optimizer. The model was trained using the fit() function with the training data, 50 epochs, and a batch size of 16. After training, we evaluated the model on the test set using the evaluate() function with the mean squared error as the metric.

In [14]:
#Build the neural network
model = Sequential()
model.add(Dense(10, input_dim=6, activation='relu'))
model.add(Dense(1, activation='linear'))

#Compile the neural network
model.compile(loss='mse', optimizer='adam', metrics=['mse'])

#Train the neural network
model.fit(X_train, y_train, epochs=50, batch_size=16, verbose=0)

#Evaluate the neural network
nn_score = model.evaluate(X_test, y_test, verbose=0)[1]
print('Neural Network Score:', nn_score)
Neural Network Score: 0.04527967795729637

The neural network has a mean squared error (MSE) score of 0.04 on the test set. This means that on average, the predicted global sales of a video game are off by 0.04 squared units (e.g. if global sales are measured in millions of units, then the MSE is in millions squared). The lower the MSE score, the better the model's performance in predicting the target variable. The neural network models score that indicates it is a suitable model for predicting the success of a video game based on its characteristics.

In [15]:
from sklearn.metrics import roc_curve, auc

y_pred = np.round(model.predict(X_test))
cm_test = confusion_matrix(y_test, y_pred)

#Get the predicted probabilities for the training and testing datasets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

#Create AUC train
fpr_train, tpr_train, thresholds_train = roc_curve(y_train, y_train_pred)
auc_train = auc(fpr_train, tpr_train)


#Create AUC test
fpr_test, tpr_test, thresholds_test = roc_curve(y_test, y_test_pred)
auc_test = auc(fpr_test, tpr_test)
print("AUC Score: ", auc_test)
43/43 [==============================] - 0s 572us/step
171/171 [==============================] - 0s 530us/step
43/43 [==============================] - 0s 572us/step
AUC Score:  0.7608490674516476

An ROC evaluation was created to test the reliability of the neurel network model. The ROC graphics show that the neurel network is a good but not perfict fit. The curve shows that this model is able prevent false positives but may not always hit the true positive mark. If the AUC for the training dataset is significantly higher than the AUC for the testing dataset, it may indicate that the model is overfitting to the training data and not generalizing well to new data.

In [16]:
# your code here
#Plot AUC train
plt.plot(fpr_train, tpr_train, label='Train AUC = {:.3f}'.format(auc_train))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Training Set')
plt.legend(loc='lower right')
plt.show()

#Plot AUC test
plt.plot(fpr_test, tpr_test, label='Test AUC = {:.3f}'.format(auc_test))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Testing Set')
plt.legend(loc='lower right')
plt.show()
In [17]:
from sklearn.model_selection import KFold, cross_val_score

# Define the number of folds for cross-validation
num_folds = 5

# Define the K-fold cross-validator
kfold = KFold(n_splits=num_folds, shuffle=True)

# Define a list to store the cross-validation scores
scores = []

# Loop through each fold
for train_index, test_index in kfold.split(X):
    
    # Split the dataset into training and testing sets
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Build the neural network
    model = Sequential()
    model.add(Dense(10, input_dim=6, activation='relu'))
    model.add(Dense(1, activation='linear'))

    # Compile the neural network
    model.compile(loss='mse', optimizer='adam', metrics=['mse'])
    
    # Train the neural network
    model.fit(X_train, y_train, epochs=50, batch_size=16, verbose=0)
    
    # Evaluate the neural network on the testing set
    scores.append(model.evaluate(X_test, y_test, verbose=0)[1])

# Calculate the mean and standard deviation of the scores
mean_score = np.mean(scores)
std_score = np.std(scores)

print('Mean score:', mean_score)
print('Standard deviation:', std_score)
Mean score: 0.06395705863833427
Standard deviation: 0.034344017608940534
The mean score and standard deviation for the k-fold cross-validation of the neural network are relatively low. These values suggest that the performance of the neural network is consistent across different folds and that the model is reasonably accurate in predicting the global sales of video games based on their features. However, it is important to note that the performance of the model may vary depending on the specific dataset and the distribution of the features.

Finalization¶

Using the neural network created above, we can predict which genre of game would produce the largest global sale. We start with a fresh dataset named genre_df and make it a copy of our df. Slide gere_df into the neural network model from above to see what the predicted highest global sales genre will be. 

Comparison to DataSet numbers¶

This portion of code is only being used to test what the model predicted vs the data set averages. The purpose is to see ensure the the predicted values were not just following the data history. Gives a solid comparison of the predicted values of the model and the history of data that was collected.
In [18]:
genre_columns = ['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle', 'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports', 'Strategy']

for genre in genre_columns:
    genre_df = df[df[genre] == 1]
    print(genre)
    print('Mean Critic Score:', genre_df['Critic_Score'].mean())
    print('Mean User Score:', genre_df['User_Score'].mean())
    print('Mean Global Sales:', genre_df['Global_Sales'].mean())
    print()

genre_mean_sales = df.groupby(genre_columns)['Global_Sales'].mean()
max_genre = genre_mean_sales.idxmax()
min_genre = genre_mean_sales.idxmin()

max_genre_name = ", ".join([genre_columns[i] for i in range(len(genre_columns)) if max_genre[i] == 1])
min_genre_name = ", ".join([genre_columns[i] for i in range(len(genre_columns)) if min_genre[i] == 1])

print("Genre with highest mean global sales: ", max_genre_name)
print("Genre with lowest mean global sales: ", min_genre_name)
Action
Mean Critic Score: 67.82883435582822
Mean User Score: 7.095828220858897
Mean Global Sales: 0.7381349693251518

Adventure
Mean Critic Score: 66.13306451612904
Mean User Score: 7.160887096774193
Mean Global Sales: 0.3256048387096776

Fighting
Mean Critic Score: 69.73280423280423
Mean User Score: 7.3018518518518505
Mean Global Sales: 0.6612433862433865

Misc
Mean Critic Score: 67.4609375
Mean User Score: 6.8497395833333306
Mean Global Sales: 1.0840104166666664

Platform
Mean Critic Score: 70.0
Mean User Score: 7.377171215880896
Mean Global Sales: 0.9374689826302726

Puzzle
Mean Critic Score: 70.69491525423729
Mean User Score: 7.2508474576271205
Mean Global Sales: 0.6686440677966101

Racing
Mean Critic Score: 69.54388984509467
Mean User Score: 7.104302925989677
Mean Global Sales: 0.8196557659208258

Role-Playing
Mean Critic Score: 72.82022471910112
Mean User Score: 7.618539325842696
Mean Global Sales: 0.704171348314606

Shooter
Mean Critic Score: 70.98148148148148
Mean User Score: 7.086458333333339
Mean Global Sales: 0.9449999999999973

Simulation
Mean Critic Score: 69.96969696969697
Mean User Score: 7.1966329966329985
Mean Global Sales: 0.6824915824915825

Sports
Mean Critic Score: 74.17073170731707
Mean User Score: 7.11081654294804
Mean Global Sales: 0.8842523860021196

Strategy
Mean Critic Score: 73.12359550561797
Mean User Score: 7.352808988764042
Mean Global Sales: 0.26071161048689145

Genre with highest mean global sales:  Misc
Genre with lowest mean global sales:  Strategy
In [19]:
# Create a new dataframe
genre_df = df

# Use the trained neural network to predict global sales for each genre
genre_df['Predicted_Global_Sales'] = model.predict(genre_df.iloc[:, 5:11].values)

# Find the column with the highest predicted global sales
max_sales_col = genre_df.iloc[:, -13:-1].columns[np.argmax(genre_df.iloc[:, -13:-1].values)]

print('The genre with the highest predicted global sales is:', max_sales_col)
214/214 [==============================] - 0s 517us/step
The genre with the highest predicted global sales is: Role-Playing

Conclusion¶

Based on the analysis of the video game sales data, it can be concluded that the Role-Playing genre is predicted to have the highest mean global sales, followed by the Shooter genre. On the other hand, the Adventure genre is predicted to have the lowest mean global sales, followed by the Puzzle genre.

It is important to note that these predictions are based on historical data and market trends and are subject to change in the future. However, the insights gained from this analysis can be useful for game developers and publishers in making informed decisions about which genres to focus on and invest in.

Overall, the Role-Playing genre appears to be a promising choice for developers and publishers, given its strong track record of global sales.